法国专利FR3023398A1 METHOD AND APPARATUS FOR SYNCHRONIZED EXECUTION OF AN APPLICATION IN A HIGH-AVAILABILITY ENVIRONMENT

专利PDF首页>>法国专利

专利附录

专利说明

权利要求

类似技术

同族专利

引用文献

法律状态

优先权

专利摘要:
The present invention essentially relates to a method of synchronized execution of an application in a high availability environment comprising a plurality of calculation modules interconnected by a very high speed network, characterized in that it comprises the following steps: configuring (2000) the plurality of modules into a plurality of partitions including at least: two execution partitions, one primary, the other secondary, a control partition; execution (2100) of the application on each execution partition, input-output processed by the main partition being transmitted to the secondary execution partition via the control partition; synchronization (2200) of the executions via exploitation of the microprocessor context changes; - transmitting (2300) a catastrophic error signal to the control partition; - continuation (2500) of the execution by passing in a degraded mode, the execution continuing on a single partition.
公开号:FR3023398A1
申请号:FR1456274
申请日:2014-07-01
公开日:2016-01-08
发明作者:Georges Lecourtier
申请人:Bull SA；
IPC主号:

专利说明:

[0001] FIELD OF THE INVENTION The present invention relates to a method and a device for synchronized execution of an application in a high availability environment. The present invention is essentially intended to improve the resilience of the environment. Resiliency is defined as the ability of a system to survive a number of failures and to allow its repair without causing service interruption. The field of the invention is, in general, that of information processing or communication systems, and more particularly that of computer production systems using high availability servers. BACKGROUND OF THE INVENTION Known computer systems are all subject to hardware or software failures that affect their general operation at random times. When one of these systems manages critical functions in terms of the security of goods or people, the behavior of the electronic system in the presence of a failure becomes a determining element of the overall reliability perceived by the users. This behavior will define the resiliency class of the system. It is totally dependent on the technical choices made at the time of the design of the system because it relies on hardware redundancies that always involve a certain cost. The resilience class is therefore the result of a compromise between cost minimization and maximization of availability. Several types of solutions have been developed to best meet the requirements of resilience in terms of hardware. Four classes of hardware component failures and three classes of software component failures can be taken as significant examples of events affecting the operation of a computer.
[0002] The four classes of hardware components examined are memories, input-output circuits, power supplies and cooling components, and processors.
[0003] Memories Memories are among the components whose raw reliability is the worst because of the very high miniaturization of memory points which thus become very sensitive to manufacturing defects and disturbances of various origins affecting their environment. Fortunately, these problems have been studied for a very long time and a whole range of measures, such as error-correcting codes, aim at enabling the memories to offer an acceptable operational reliability. Inputs / Outputs I / O failures are characterized by one or more errors detected in a protocol or in a data packet exchanged between two systems. The vast majority of these errors originate from the electromagnetic environment of the transmission (noise, ...) and the problem is solved by retrying the input-output transaction. In cases where the link is physically interrupted by a hardware circuit failure (cable, connector, laser, ...), the resilience is usually provided by an input-output redundancy channel that offers an alternative path to the data, whether this passage is direct between the two systems or relayed through a computer network.
[0004] Power Supplies and Cooling Components In a high availability system, power supplies and fans or hydraulic pumps are always redundant. Failure of any of these components does not affect user applications. The repair is performed by a hot exchange of the power supply or the cooling device. In contrast, the lowest level power supplies, also known as Point Of Load (POL), which directly supply power to processors, memories, or input-output circuits, are not usually redundant and are soldered to the motherboard for cost, size, and performance issues. Stopping these types of components is therefore fatal for the functional components that are directly attached to them. The failures of POL will therefore be assimilated to breakdowns of functional components. Processors Processors are in a class of their own because their failures are of multiple origins and need to be explained.
[0005] Since the processors of critical systems generally support a multi-core architecture, we will consider this hypothesis. To ensure the resilience of the processors, hardware redundancy is again used. Here, the hardware is the direct support for running system software and applications. Duplicating an application processor causes the need to synchronize the foreground processor, the one that interacts with the users, and the second processor (s), which will replace, on the fly, the foreground processor when it is a victim of a failure. In the prior art, and for fairly simple processors, one solution consisted of tripling on the same card or in the same integrated circuit the calculation functions and adding a comparison circuit of the results based on the states of the common buses. input-output and memory. The width of these buses could reach 128 bits of data and 32 bits of addressing. At the scale of each machine cycle, typically between 1 and 10ns, the comparison circuit delivered a consistency status based on a majority voting principle: the correct state is that which is present in at least two of the three processors. As soon as an error was detected in the foreground processor, or the comparison circuit indicated that the bus of this processor was different from that of the other two buses, the foreground processor was stopped and the execution continued on the first secondary processor. Possible failures of second-level processors were reported and led to a switch to degraded mode, waiting for the exchange of the processor card. As the exchange of this card involved stopping the service, it was previously necessary to move the applications to an equivalent backup system which, in principle, should not have any hardware element in common with the main system. The duration of this recovery could seriously affect the availability of the service if the volume of information to be moved was important. If we wanted to overcome this constraint, it was necessary to provide three separate processor cards in the same cabinet or the same chassis, but unfortunately, the results comparison circuit very quickly led to a performance limitation because of the limited speed of propagation electrical or optical signals. This mode is therefore both expensive and inefficient, especially if we consider the implementation of standard multi-core sockets that do not incorporate bus status comparison functions outside the socket.
[0006] Thus, in the prior art, the solutions to the problem of high availability do not cover without significant performance degradation the failures of the processors themselves. The three classes of software components that cause reliability issues are the critical sections of hypervisors and operating systems, I / O driver faults, and firmwares (ie, firmware). embedded in the hardware, and the defects of the applications. Critical Sections A critical resource is a collection of executable data or instructions protected by a software lock. A critical section is the sequence of instructions that an execution core uses to use a critical resource. It is well known that the software lock that protects against the simultaneous execution of the same critical section by more than one core uses a special instruction of the "Test-And-Set" type implementing a specific hardware device in the processors. and the cache controllers. At the output of a critical section, the process running on the execution core in question must release the critical resource by replacing the software lock in an open state. It is common in multitasking programs running on systems of symmetrical multiprocessor type, also called SMP for "Symetric MultiProcessing" in English, with several cores and sharing the same memory, to create deadlocked deadlock in English, because of erroneous management of critical resources. Operating systems are thankfully largely free of these deadlocks through intensive testing, but user applications are often affected by such problems. It is then up to the operating systems to be able to release the resources blocked by a user deadlock by preempting the heart and suspending the execution of the faulty process. However, if a hypervisor creates such a deadlock by itself, no software process can unblock it and such a defect is comparable to a hardware failure that can affect the user data. Input-output drivers and firmwares The input-output and firmware drivers are characterized by their close contact with the hardware and their high level of optimization with regard to real-time performance. Each evolution of the hardware may require an update of these software if they are not hard-coded in the electronic cards, but since the situations of defect appear only in very rare conditions, it is common to execute a firmware or a driver in an outdated version. Application Deficiencies Applications managed by an operating system, also called OS for Operating System in English, are subject to the procedures of this OS and are usually stopped ("killed") if they end up in deadlock using 100% of the operating system. power of a processor core. Other defects, those which do not block a critical resource of the system, correspond to application logic errors affecting the accuracy of the results and do not relate to the present invention.
[0007] If a hypervisor is present in the system, it acts as a "super-OS" and the OS or OS running under this hypervisor, called OS guest, are isolated from the hardware and also from each other. In this case we can say that a guest OS is similar to an application under OS. In particular, the hypervisor is able to "killer" (we are talking about killing a process), to restart ("resetter") or to move a complete OS with its applications from one system to another through network resources . The travel time, also called migration, depends of course on the bandwidth of this network.
[0008] To avoid burdening the presentation, software-based failures will not be tackled extensively, although they may be a major cause of failure. The repair of these software failures is based on the installation of new versions of software or firmware. Normally, if the system has been well designed, these installations can be performed remotely, without physical intervention on the equipment. The rare case where a software failure involves direct access is that of a default in the boot firmware, also called bootstrap firmware in English, ie in the code used to update other softwares or firmwares . It is generally treated by a double code version: the so-called factory version and the so-called user version. The only version that follows the maintenance changes is the user version. The factory version is write-protected and therefore allows you to regenerate the system configuration at any time. There will therefore remain the failures of the bootstrap memory itself, which are among the cases treated in the analysis that follows.
[0009] GENERAL DESCRIPTION OF THE INVENTION The purpose of the invention is in particular to remedy the problems that have just been described. Indeed the invention aims to minimize the downtime of the servers. For this purpose, the invention proposes a method of synchronized execution of an application in a high availability environment comprising a plurality of calculation modules interconnected by a very high speed network, characterized in that it comprises the following steps Configuring the plurality of modules into a plurality of partitions including at least: two execution partitions having number and feature identical nodes, a main execution partition, and a secondary execution partition; O a control partition comprising at least one module; execution of the application on each execution partition, input-output processed by the main partition being transmitted to the secondary execution partition via the control partition which logs these I / O as transactions distributed between the partition main execution and the secondary execution partition, - synchronization of the executions on the main partition and on the secondary partition by the numbering of the microprocessor context changes, a time difference corresponding to the difference between the current context numbers, on the partitions principal and secondary performance, to be less than a given value; transmitting a catastrophic error signal to the control partition, said signal being characteristic of a failure in a module of an execution partition; - continue execution by passing into a mode called degraded mode, the execution continues on a single partition, and in the event of a failure of a module of the main execution partition, failover operation, the partition secondary execution becomes the new primary execution partition. The method according to the invention can comprise, in addition to the main steps which have just been mentioned in the preceding paragraph, one or more additional characteristics among the following: - an interruption is forced at a predetermined frequency, to force a change of context microprocessor and thus proceed to a synchronization; the predetermined frequency is of the order of 20 Hz, the desynchronization being thus limited to 50 ms; - the application is virtualized via a hypervisor; the virtual machines are identical on the main execution partition and on the secondary execution partition, the virtual machines running on a single processor core; the step of configuring the plurality of modules in a plurality of partitions takes place dynamically, without stopping the service of the environment; - The apparent unavailability of the environment is equal to the desynchronization between the execution of the application on the primary and secondary partitions, the unavailability being limited by the synchronization of the executions.
[0010] In addition, the invention proposes a digital storage device comprising a file corresponding to instruction codes implementing the method according to the invention. The invention also proposes a device implementing the method according to the invention. The invention also proposes a device implementing the method according to the invention and comprising a possible combination of the following characteristics - the device comprises a plurality of modules inserted in a chassis characterized in that a base of a module comprises guides of inserts fixed to the base these guides being of two types: a first type, to ensure a final guide and a protection of a connector of the base, a second type having a cylindrical shape terminated by a cone for provide initial guidance. each guide having a corresponding means in the chassis; the first type has a cylindrical shape terminated by a cone, its length being less than that of the second type of the order of one centimeter; the first type is a recessed piece capable of cooperating with a cylindrical shape terminated by a cone; at least one guide of the second type is rigidly fixed to the base; at least one guide of the first type is fixed in a floating manner to the base; at least one cylindrical guide cooperating with the guide of the first type is fixed floatingly to the frame; the base comprises four guides of the first type and two guides of the second type; the guides are distributed symmetrically with respect to symmetry axes of the base. The invention and its various applications will be better understood by reading the following description and examining the figures that accompany it. DESCRIPTION OF THE FIGURES These are presented only as an indication and in no way limit the invention. The figures show: in FIG. 1, a logical architecture allowing the implementation of the method according to the invention; in FIG. 2, a representation of different steps of the method according to the invention; in FIG. 3, an illustration of an element of a device embodying the invention; - In Figure 4, an illustration of a connector block with floating module side; - In Figure 5, an illustration of an implementation of the invention with floating frame side. DETAILED DESCRIPTION OF AN EXAMPLE ILLUSTRATING THE INVENTION FIG. 1 shows an example of a logical architecture that makes it possible to implement the method according to the invention. It will be described later a mode of materialization of this virtual architecture on a known hardware architecture. FIG. 1 shows two identical partitions, a first partition A and a second partition B, each comprising a hypervisor H and a virtual machine VM. A hypervisor, also called a virtualization engine, is a set of code instructions that emulates, for a virtual machine, a hardware architecture. Among the best known are: KVM, HyperV and VMWare.
[0011] A virtual machine, called VM for "Virtual Machine" in English, is mainly composed of a configuration file that describes resources (microprocessor, disk, input / output devices ...) that must be virtualized by a hypervisor. In addition, each of the partitions A and B comprises storage means M in which instruction codes corresponding to a hypervisor according to the invention and files corresponding to virtual machines are stored (configuration, status, notably the state of the processor, disk virtual, ...). FIG. 1 shows I / O input / output peripherals connected to the first partition A via the hypervisor H_A1 of the first partition A. Since the partitions A and B are identical, FIG. I / O outputs can be connected to the second partition B via the H_B1 hypervisor of the second partition. This possibility is a potential connection that is established in certain circumstances explained below.
[0012] One of these two virtual machines of the execution partitions is defined, for example in the configuration file, as a mirror of the other. In our example it is the secondary partition which is considered a mirror. FIG. 1 also shows a third partition C comprising a virtual machine VM _C1 and storage means M_C. The virtual machine VM _C1 of the third partition C logs the input-output of the first partition A in a file and transmits it to the second partition B. In addition, the virtual machine VM _C1 of the third partition traces with numbers of the changes context (Fig.1 CdC) of the hypervisor H_A1 of the first partition A and the hypervisor H_B1 of the second partition in order to synchronize virtual machines. In a preferred embodiment of the invention, before the launch of the virtual machines, the virtual machine of the second partition is obtained by copying the virtual machine of the first partition, which guarantees synchronized virtual machines at startup. When an action is taken on a device, hardware or virtual, it is in fact carried out by a microprocessor of the device controlled by instruction codes stored in a memory of the device. If an action is taken to an application, it is actually performed by a microprocessor of the device in a memory of which the instruction codes corresponding to the application are recorded. Figure 2 shows steps of the method according to the invention. In particular, FIG. 2 shows a step 2000 of configuring a plurality of modules in a plurality of partitions including at least two execution partitions comprising nodes that are identical in number and in characteristics, a main execution partition, and a secondary execution partition; In our example, the main execution partition is the first partition A, the secondary execution partition is the second partition B; n - a control partition comprising at least one module; In our example, the control partition is the third partition C. In one embodiment, a hypervisor and a virtual machine are installed on each execution partition on which the application to be executed is installed.
[0013] In a step 2100, following the configuration step 2000, the same application is executed on the main execution partition A and on the secondary execution partition B. This is an execution step. Input-outputs processed by the main execution partition A are transmitted to the secondary execution partition B via the control partition C which logs these inputs-113 outputs as transactions distributed between the main execution partition A and the secondary execution partition B. The transaction is distributed because it is validated only if the operations have been correctly carried out on both partitions A and B. A synchronization step 2200 ensures synchronization of the 15 executions between the partition main A and the secondary partition B by means of the numbering of the microprocessor context changes made by the control partition C. A time difference, corresponding to the difference between the current context numbers on the main execution partitions A and secondary B must be kept below a predetermined value, for example 50ms. In a variant, this difference is estimated for example by the difference between context change numbers. In this case it is not really a time difference but a difference between numbers belonging to two sequences representing the same sequence of events, an event being here a change of context. The execution and synchronization steps are executed in parallel, i.e. simultaneously. At each change of context, for example, the hypervisor of the first partition checks the progress of the execution by the mirror virtual machine. If too large a deviation is found, that is to say greater than 50ms in our example, then the hypervisor of the first partition delays before finalizing the change of context, that is to say before allowing to the virtual machine of the first partition to continue its execution. In order to guarantee a time interval, at least one change of context is made at the desired frequency. In our example we force a hardware interrupt every 50ms. Such an interruption causes a change of context, which ensures that the means to prevent too great a difference between the mirror executions are well implemented. When a failure in a module of an execution partition is detected, a step 2300 of transmitting a catastrophic error signal to the control partition C occurs. Two cases then arise depending on whether the failure occurs in: - the main execution partition A (FIG. 2310), in which case a failover step 2400 takes place, the secondary execution partition B becoming the new partition d the main execution, then proceed to a step 2500 of execution execution; the secondary execution partition B (FIG. 2320), in which case it goes directly to the step 2500 of continuation of the execution.
[0014] This method according to the invention makes it possible to increase the resilience of a device with high availability. A high availability device for implementing the invention is described from hereinafter. A hardware architecture for the implementation of the invention is, for example, a system comprising eight modules, each of the modules being connected to all the others, we speak of "all-to-all" topology. Other topologies can be used as soon as they offer several paths between the different modules. This number of modules is given as an example and can vary from two to several tens.
[0015] Each interconnection port of each of the modules must therefore support seven high-speed links, each link comprising eight bidirectional channels. If each link operates at 14 Gb / s, the bandwidth between any two modules in the system is 14 x 8 = 112 Gb / s. This frequency is given as an example and can vary in significant proportions, up to the technological limitations imposed by connectors and cables. Each module supports two sockets connected locally by a high speed bus traced on their motherboard. This arrangement makes it possible to reconfigure a module dynamically to support certain failures of sockets or memories. It will be noted for the following that the types of previous failures are rarely sudden failures, such as power failures, and are preceded for a fairly long period of time by the rise of error signals that indicate that the reliability of at least one hardware component is degrading. Proper management of these error signals is critical to achieving good resiliency performance. These dual-socket cards are equipped with high-speed connectors at the interface with the interconnect module and all electrical circuits connected to these connectors are capable of supporting hot-plug operations. This implies that the insertion or the desinsertion of a module in the cluster is carried out without disturbing the operation of the other modules of this cluster. In particular, the connection of a non-powered module to the interconnection module cables must not disturb the various switches, also called "switches" in English, which operate on the other seven two-socket cards. In addition each bi-socket card supports an embedded microcontroller, called BMC for "Baseboard Management Controller" in English, to manage the server's very low level functions such as poweron / poweroff, monitoring temperatures and voltages or the dynamic reconfiguration of the cluster of modules. Reconfiguration Interface To assist in the automatic configuration or reconfiguration of the multi-modular system, an Ethernet switch, called the MSM for "Management Switch Module" in English, is incorporated in the interconnection module. This switch may consist of two switches of five ports, four downlinks and a rising link, the ports of the upstream links are connected by a circuit trace. This embodiment further improves the resilience of the interconnection module by limiting the number of cases in which the BMCs lose all of their means of dialogue. In association with this private Ethernet network to the BMCs, an auxiliary device of the interconnection module makes it possible to allocate to each BMC an identifier from which it will be able to calculate its IP address. Indeed each BMC must be able to start its IP connection on this network with an address different from that of its neighbors. This subsidiary device based on a small FPGA makes it possible to transmit to each BMC, at the time of initialization, a port number (identifier on 3 bits) specific to each physical location of the module relative to the interconnection module. From these 3 bits, each BMC completes its IP address before connecting to its neighbors. It should be noted that, since the sub-network between the BMCs and the MSM switch is entirely private, the 29 highs of the IP address can be identical for each module.
[0016] To ensure good resilience to power failure, each module has at least two AC-DC converters, also called PSU for "Power Supply Unit" in English, delivering power on the rail, generally under 12V, at the entrance of each bi-socket card. PSUs can be connected to two independent AC networks to ensure system operation if one of these networks fails. Similarly, the MSM module is powered by several servers: each module under power sends through its link Ethernet - MSM current under 12V. These different currents are summed by a diode switch located between the 12V inputs and the POL of the circuits of the Ethernet switch and the FPGA reconfiguration.
[0017] Synchronization interface Another constituent element of the invention is a network of synchronization clocks. A subnet of the interconnection module connects in all-to-all all the FPGAs of all the modules to constitute a quasi-perfect synchronization system between the processors of the different modules. The FPGA logic allows, at initialization or reconfiguration, to partition the clock network into as many clock subnets as there are logical partitions. In each clock subnetwork, the master processor, usually the processor that contains the startup logic, distributes its clock to all the processors belonging to its partition.
[0018] The FPGA isolates each logical partition so that a partition receives only one and only one clock. This clock is a square signal whose frequency is a few megahertz, for example 25MHz. This signal may advantageously be filtered by a PLL circuit at the input of each module to eliminate any phase noise that may have been introduced by the cables of the interconnection module. At the time of the restart, each module starts a counter controlled by the clock of its partition. As this clock is common to the processors of the partition, each processor has a synchronous time reference allowing it to date, we speak of "time stamp" in English, any event that takes place in its module and at any time it will be possible to reconstitute the chronology of several events without ambiguity on their respective order. An example of an application is that of the CATERR catastrophic error signal management in English, which is part of the out-of-band interface of the calculation modules. This is a network that is not used for calculation tasks but only for administration of the module. These communications have no impact on module performance. These signals are distributed in the partitions in the same manner as the clock described above. When a CATERR signal is triggered in a module, it usually causes very quickly, in microseconds or less, a similar event in one or more of the adjacent modules. The time stamping of the CATERR events in each module allows the BMCs to exchange these measurements and to find out what is the source of the problem and thus facilitates the diagnosis and precise localization of the faulty module.
[0019] Infrastructure Considerations The infrastructure is defined by the set of hardware components that provide the physical links between the modules. To ensure the reliability of this infrastructure, it is advantageous to use a maximum of passive components, for which the mean time between failures, called MTBF for "Mean Time Between Failures" in English, exceeds by several orders of magnitude the MTBF cards mothers of the modules. Thus, the number of failure cases created by this infrastructure can be neglected in the evaluation of system availability. In the system described, the infrastructure consists of an interconnect module, also called a backplane, which distributes the high-speed link interfaces, called XQPI interfaces, between the eight modules, and the local Ethernet interfaces. out of band between the eight modules and the MSM switch. The backplane to support signals at very high speed, 14 Gb / s today, even more in a few years, a realization by traditional printed circuit was not retained. Indeed, even using high-end materials, such as the Panasonic Megtron6, insertion losses exceed 20 dB at this data rate for the links between the modules furthest away physically from each other. Another disadvantage of the backplane board relates to mechanical reliability. Given the weight of a solid module and the manufacturing tolerances of a chassis and its guides in the cabinet, the mechanical rigidity of such a backplane exerts very high mechanical stresses on the high speed connectors of the XQPI interface. These constraints are a source of mechanical failure of the infrastructure.
[0020] To avoid this problem, the backplane of the multi-modular system of the invention uses links on copper cables terminated by floating connector mounts that tolerate up to 1 mm of mechanical misalignment in x and y without exerting stress on high speed contacts. In addition, insertion losses per unit length are about ten times lower than in conventional PCB material and the number of connectors and vias is reduced to a minimum. These elements contribute to improving the reliability of the infrastructure and therefore to the goal of high availability of the system. The floating aspect is obtained, for example, by the use of guides. FIG. 3 shows that a calculation module 3010 comprises a rectangular rigid base 3020 to which module components are attached, which includes a module 3040 male or female connector capable of being connected to a female 3050 connector or a chassis 3050 connector. 3030 chassis with backplane or backplane bus. The chassis 3030 is here only partially shown to illustrate the embodiment. It is clear to those skilled in the art that if the module has a male connector, then chassis has a female connector and vice versa. In a variant the base 3020 is an edge of a printed circuit board. FIG. 3, which shows only part of the base of the module, shows that the base comprises, on the side facing the chassis connector: a first guide 3060 and a second guide 3070 located on either side 3040 module connector. This placement of the guides makes it possible to protect the connector at least mechanically. The first guide and the second guide of a first type have substantially the same length and are: o of cylindrical shape, o fixed to the base, o end in a cone on their free side, o a length of a few centimeters , at least 1 but less than 5. a third guide 3080 of a second type which is: o of the same shape as the first guide, o substantially longer than the first guide, of the order of one centimeter, in practice 1 centimeter. o located on the large central axis of the rectangle formed by the base and a few centimeters of the first and second guides. Figure 3 also shows that the chassis 3030 has orifices corresponding to the first, second and third guides. In a manner not shown the base comprises at least a fourth guide such as the third guide. This fourth guide is symmetrical to the third relative to the minor axis of the rectangle formed by the base. A large axis of a rectangle is a mediator of a small side. A small axis of a rectangle is a mediator of a large side. In a variant, the first and second guides also have their symmetrical relative to the minor axis, the base then comprises six guides. These axes are axes of geometric symmetry of the base. In general, these axes also correspond to axes of symmetrical masses, that is to say axes on either side of which the mass of the device is distributed equitably. If this is not the case, we can also use mass axes to distribute the guides. It is clear that the frame has as many orifices adapted to cooperate with guides as the base comprises guides. This arrangement of guides makes it possible to perform a first positioning, or initial guidance, of the module relative to the frame using the third guide and its symmetrical which provides a first level of guidance. This first level is a placement relative to the chassis. Then continuing the insertion we end up using the first and second guides and their symmetrical which provides a second guide, final, while protecting the connectors. In a variant of the invention the guides are rigidly fixed to the base. This makes it possible to say that one has a floating device because the conical endings of the guides allow a certain clearance. This floating aspect makes it possible to change a module in an improved manner while the chassis is being used. This floating aspect also minimizes the risks of deterioration when inserting a module whose weight is greater than ten kilos and which can be located two meters from the ground or flush with the floor in a cabinet.
[0021] This floating appearance, regardless of a hot change it reliability, also makes reliable an off-farm installation minimizing the risk of damage when inserting modules. Thus, this floating aspect improves the replacement of a faulty module after the execution of the applications has been delegated, according to the invention, to another partition. But this floating aspect is also interesting outside application mirroring context, for a simple installation. In another variant of the invention the first guide 3060 and the third guide are fixed floatingly to the base. For example, FIG. 4 shows that the base 3020 comprises, on its face opposite to the third guide, two parts perpendicular to the base: a first connector support part 4010, a second connector support part 4020. The first connector support part and the second connector support part 15 are parallel and comprise at least one orifice 4015 allowing the passage of a shoulder screw. The diameter of at least one port 4015 of a connector support is slightly greater than the diameter of a shoulder 4035 of a stepped screw 4030 used through the port. Thus once fixed, the screw can play in the hole due to the difference in diameter between the shoulder and the orifice. This difference in diameter is in the range of 1 to 2 mm. Figure 4 also shows a connector block 4040 slidable between the connector supports. The connector block has a base on which are fixed: the first guide 3060, the second guide 3070, the connector 3040 module. The connector block also has a first wall 4050 and a second wall 4060 parallel and such that once the connector block in place the walls are parallel to the support pieces. FIG. 4 also shows that a wall of the connector block has a threaded orifice to obtain a thread compatible with that of the shouldered screw 4030. Once the shoulder screw is screwed, the connector block is floating vis-à-vis the base. This variant further improves the floating appearance. This floating depends on the difference in diameter between orifice and shoulder.
[0022] In one implementation, each support piece has 4 holes for 4 stepped screws. The walls then comprise the corresponding 4 tapped holes. The 4 holes are distributed in square. In one implementation, each support piece has 3 orifices for 3 stepped screws. The walls then comprise the 3 corresponding threaded holes. The 3 orifices are divided into isosceles or equilateral triangle. In one implementation, each support piece has 2 orifices for 2 shouldered screws. The walls then comprise the 2 corresponding threaded holes.
[0023] FIG. 5, another variant of the invention, shows a module 5010 here assimilable to a mother card 5010. The base 5020 of the mother board 5010 is an edge 5020 of the mother board 5010. FIG. 5 shows that the base 5020 of the card 5010 comprises a connector 5040 and on each side of this connector 5040 is: a first guide piece 5060 in the form of a rectangular parallelepiped hollowed out of a cylindrical shape, the axis of the cylinder being perpendicular to the base 5020 of the mother board 5010; A second guide piece 5070 identical to the first guide piece.
[0024] In one variant, the opening of the cylinder on the base side has a conical opening allowing a better introduction of a cylindrical guide. Figure 5 also shows that the base of the motherboard includes a guide 5080 similar to the guide 3080 of the base module. FIG. 5 also shows a frame 5030 on which a connector block such as that described in FIG. 4 is fixed. FIG. 5 thus preserves identical references, for identical functions, to that of FIG. chassis 5030 has, on its face opposite to that vis-à-vis the motherboard, two pieces perpendicular to the motherboard, once the motherboard in place: a first connector support part 4010 a second part 4020 of connector support. The first connector support piece and the second connector support piece are parallel and comprise at least one orifice 4015 allowing the passage of a shoulder screw. The diameter of at least one port 4015 of a connector support is slightly greater than the diameter of a shoulder 4035 of a stepped screw 4030 used through the port. Thus once fixed, the screw can play in the hole due to the difference in diameter between the shoulder and the orifice. This difference in diameter is in the range of 1 to 2 mm. Thus provided, the frame is therefore able to receive a connector block which will then be floating. Figures 3 to 5 illustrate several embodiments for a floating connection between a module and a chassis.
[0025] The MSM switch of the invention, which includes some active elements, is part of the infrastructure. It has been seen previously that its power supplies are highly redundant. In addition, the very low power dissipation, of the order of 2 to 3 watts, and the low number of components make it possible to predict an MTBF of several million hours for the MSM module. Finally, it should be noted that the operation of the MSM switch is monitored by the eight BMCs of the modules and that, if an MSM failure is detected, the hot replacement of this module does not interrupt the operation of the multi-modular system. . It can be said that the impact of the MSM on the availability rate of the system is still negligible, as long as the physical exchange of the module occurs within the usual delays. We are now interested in the description of the various software components used by the invention.
[0026] Octo-Module Configuration The eight-module configuration described above is considered to support a 1 + 1 redundant system, where each processor in one partition has a redundant processor in another partition. This configuration is the simplest to describe, but for security reasons, a redundant configuration 1 + 2 or 1 + 3, where each processor of a partition has respectively two or three redundant processors in respectively two or three other partitions, can to be used. During a configuration step (El) of the plurality of modules in a plurality of partitions, three modules are allocated to a partition A, a foreground partition. Therefore a partition B, secondary partition, is also constituted with three modules. The remaining two modules in the eight-module configuration are assigned to a C partition, a control partition, one of which supports partition A and B partition execution logs. The second module in partition C can temporarily replace any which other module failed, according to the rate of availability required. The C partition can optionally perform non-critical applications, provided that the priorities of these applications are well below the priorities of the software modules that provide the high availability functions for the applications running in the A and A partitions. B. Critical application load is 37.5% of the installed processing power, or three modules out of a total of eight. For a quad-module configuration, partition A and partition B each contain a single module and partition C contains two modules. In this case the load of critical applications only represents 25% of the processing power installed. Exploitation of the architecture This invention only uses OSes running on a single core. Indeed, although it is known how to make OS cores running on multicore processors, the implementation of such cores in this invention can lead to insoluble process synchronization problems. This does not mean that the physical processor is a single-core processor. The use of multicore processors is only constrained by the allocation of the physical resources provided by the hypervision layer, according to which a given OS receives only one core and only one. Nothing prevents the same heart from being assigned to several distinct OSs. In the latter case, only performance or response times can be affected. In the description that follows we assume the simple rule: 1 heart = 1 OS. Of course, under each OS, users are free to run as many simultaneous applications as they want. The computing load that is generated by these applications is multiplexed in a conventional manner by the scheduler, called "scheduler" in English, the core of the OS. The higher the load, the lower the performance.
[0027] For process scheduling, each OS manages a queue of processes that are ready to run, called process ready in English. Only the first process of this queue receives the processor and therefore is in a so-called "running" sub-state. When the running process encounters a blocking synchronization semaphore, called P-Op, it leaves the ready process queue and goes into a waiting state, or "waiting" in English, until another process or an I / O interrupt handler unblocks it by a V-Op operation on the same semaphore. At this point, the process is re-entered in the ready process queue based on priority rules that depend on the OS and its configuration. We place ourselves here in the hypothesis of a queuing in a mode called 'insertion by priority'. This point means that the queue in question consists of several sub-lists, a sub-list by priority level, the sub-lists being linked and classified between them according to their priority. Insert by priority means that the new element is inserted at the end of the sub-list associated with the priority of this new element. If this sub-list does not exist, the new entry is inserted at the end of the first present and higher priority sub-list, or at the top of the list if the sub-list in question does not exist. This type of operation performed regularly by the kernel is often called dispatching operation (dispatch) or abbreviated dispatch (dispatching). With such a kernel of process synchronization, a kernel that has become standard today, the execution on a core consists of a series of instructions separated by dispatching operations executed by the module, which is unique in the OS. , synchronization of processes. If two identical OSs, executing the same processes (same codes, same data, on identical processors), run in parallel synchronizing at the time of the dispatching operations, the states of the two OSs and consequently the results obtained at the output both OS's will be strictly identical from a logical point of view. On the other hand, the execution of the physical point of view may differ from the execution of the logical point of view. Execution of the physical point of view In what follows, we describe how the schedulers of two cores executing in parallel on the partition A and on the partition B can remain in perfect synchronization. When a program executes on a conventional computer, the logic, in the mathematical sense of the term, of this program must be executed deterministically. This means that the programmer expects that two successive executions of the same program on the same data set will always lead to exactly the same results. In reality this is only an appearance because in the background, between two logical executions, various events can come to interlace in the middle of the program and completely modify its timing of execution. The main events that can occur are: interrupts, dispatching allocating the processor to another task, or "thread" in English, following a V-Op waking up a more priority thread, recoverable errors in reading of the central memory or during data transfer on the I / O buses. This list is not exhaustive. In all cases we can say that management operations (scheduling, errors, etc ...) are invisible to the program, except for its timing. It is therefore the execution timing that introduces the notion of execution from the physical point of view of the program by adding the finest temporal component of the description. If no precautions are taken, launching several programs in parallel under an OS does not guarantee the order in which the operations will be carried out. So there can be a very fast time lag between a program running on partition A and the same program running on partition B, even if they started on the same clock. The present invention shows that the proper organization of the thread synchronization points in a multi-modular system makes it possible to ensure that the physical executions between two partitions can remain synchronous within a time interval determined in advance. for example 50ms. Since interrupts can usually be followed by the launch of a thread, and this is done through the OS kernel dispatcher, it is possible to limit the desynchronization between partitions A and B by artificially introducing interrupts generated by a hardware timer. This timer is generally available among the chipset resources and the hypervisor can initialize it to create a dispatch at a given frequency, for example 20 Hz if it is desired that the desynchronization does not exceed 50 ms. Keep in mind that on a modern processor, the context change time is of the order of 1 ps. Thanks to this feature, forcing a dispatch every 50ms has no noticeable effect on performance. It would be quite different if we wanted to synchronize the threads to 5 or 10ps.
[0028] Synchronization between 2 VMs To distinguish the case of VMs running in the same partition A or B and that of two VMs, one of which turns on A and the other on B, this latter situation is called 'non-coherent VM' for remember that the memories of A and B are non-coherent in the sense of the XQPI inter-module interface. In order to better understand the advantages provided by the invention, it is interesting to examine some cases of sudden failures and the reactions of the system according to the invention associated.
[0029] The VRM converters for "Voltage Regulator Module" are the POL converters for processors or DRAMs. When such a VRM fails, the associated circuits stop their activity after a few microseconds because a presence detector voltage, at least, switches the PWRGOOD signal zo, which causes the logic stop and the rise of the signal CATERR. The CATERR signal is transmitted by the interconnection module to all the modules of the partition. In reception, CATERR blocks all connected processors and the service processor, the BMC, initiates procedures to handle the failure.
[0030] Stopping all transactions with synchronization functions is detected at the latest in 50ms. Partition B terminates execution of its threads until this synchronization point. Then, since the failure of the partition A is noted by the synchronization function, the latter releases the threads of the partition B by simulating a V-Op which allows these threads to continue their execution. A fatal error in memory is a type of error quite common because the number of transistors that constitute the central memory can exceed by two orders of magnitude that of processors and caches. In the Intel terminology, the management architecture for these errors is called MCA for Machine Check Architecture in English. An error is said of hard type if a memory cell is definitively glued to 0 or to 1. An error is said of soft type if the contents of the cell has been altered but that the operation of this cell remains acquired.
[0031] In the MCA architecture, a fatal error in memory leads without alternative to the memory dump, called "crashdump" in English, the OS or the hypervisor. The restart is necessarily very long, several minutes in general. Using the invention, the crashdump function is preceded by the failover function which provides in less than 50ms the switchover between partition A and partition B. As a large number of fatal errors in memory are not errors hard type but soft errors following an execution failover, it is possible to restart the partition A, then progressively migrate the OS and applications of the partition B to the partition A, until the synchronization functions of partition C can actually be reactivated. At this point, the system is redundant in 1 + 1 and as it is completely symmetrical, it is not necessary to rebase the partition A in main mode. It is enough to declare that the partition B became the main partition and that the partition A became the mirror partition. The band passing the XQPI bus is important, more than 10GB / s with current technologies, migrate a central memory of 1TB will mechanically take at least 1000G / 10G = 100 seconds. This transfer is advantageously controlled by a software module residing in the partition C. This is the order of magnitude of time during which the system will not be completely redundant, but it will have ensured the resumption of service in less than 50ms. The critical failure rate is then related to the probability of a second fatal error during this 100s interval. For example, assuming a technology leading to a fatal error rate of 1 for 1018 bit read and a memory read at a rate of 1000Gb / s, the system would have an average interval between two failures of the order of 106s, ie almost 300 hours. In fact, one should be wary of this type of statistics because the memory reading process does not have the stationarity characteristics of the classical random variables. Fatal memory errors may be entirely due to a single faulty cell that would be accessed very frequently by a particular OS or application. This posed, with the above assumptions and knowing that on average there is a fatal error every 106s, the critical failure rate of the system is given by the probability of occurrence of a second fatal error in 100s, either 1/10000. And the MTBF would then be in the order of 3M hours, or 340 years. If this failure rate is deemed too high, it will be necessary to switch to a redundancy scheme in 1 + 2 or 1 + 3, that is to say to maintain synchronized more than two machines. A fatal disk I / O error is also a fairly common type of error because there are many causes of error on the storage devices. The storage subsystem of the system according to the invention is assumed in RAIDI mode, ie any file written on a disk attached to partition A has a mirror file on a disk attached to partition B.
[0032] A fatal error in this area may be due to a disk block whose contents have been overwritten incoherently by any failure. The error is detected by a bad error correction code at the time of reading of this block and this error is sent back to the input / output manager by the messages of the PCIe bus corresponding to the channel of the disk in default.
[0033] In this case, the entire file is inaccessible to the program that is trying to make I / O on that disk and usually the OS terminates the application by passing it into a suspended state. Since the mirror program did not receive the error notification live since it is running in partition B, the synchronization function is able to start the file recovery after having, for a given moment, the hand to the mirror program by simulating the synchronization V-Ops. The algorithms for this hot recovery are known and used in all RAID controllers on the market. Here, the firmware functions of the RAID controllers are replaced by one or more software functions that execute in the partition C. These functions implement event logs that ensure the integrity of the file systems. The advantage of the invention is here again to be able to guarantee that the program resumes its normal activity in less than 50 ms.

权利要求:
Claims (17)
[0001]
REVENDICATIONS1. A method of synchronized execution of an application in a high availability environment comprising a plurality of computing modules interconnected by a very high speed network, characterized in that it comprises the following steps: - configuration (2000) of the plurality modules in a plurality of partitions including at least: o two execution partitions having number and feature identical nodes, a main execution partition, and a secondary execution partition; o a control partition comprising at least one module; execution (2100) of the application on each execution partition, input-output processed by the main partition being transmitted to the secondary execution partition via the control partition which logs these I / O as distributed transactions between the main execution partition and the secondary execution partition, - synchronization (2200) executions on the main partition and on the secondary partition by the numbering of the microprocessor context changes, a time difference corresponding to the difference between the current context, on primary and secondary execution partitions, to be less than a given value; transmitting (2300) a catastrophic error signal to the control partition, said signal being characteristic of a failure in a module of an execution partition; - continuation (2500) of the execution by passing in a mode called degraded mode, the execution continuing on a single partition, and in the case of a failure of a module of the main execution partition, failover operation (2400), the secondary execution partition becomes the new primary execution partition.
[0002]
2. Synchronized execution method according to claim 1, characterized in that an interruption is forced at a predetermined frequency, to force a change of microprocessor context and thus proceed to a synchronization (2200).
[0003]
3. Method according to claim 2, characterized in that the predetermined frequency is of the order of 20Hz, the desynchronization being thus limited to 50ms.
[0004]
4. Method according to one of the preceding claims, characterized in that the application is virtualized via a hypervisor.
[0005]
5. Method according to claim 4, characterized in that the virtual machines are identical on the main execution partition and on the secondary execution partition, the virtual machines running on a single processor core.
[0006]
6. Method according to one of the preceding claims, characterized in that the step of configuring (2000) the plurality of modules into a plurality of partitions is carried out dynamically, without stopping the service environment.
[0007]
7. Method according to one of the preceding claims, characterized in that the apparent unavailability of the environment is equal to the desynchronization between the execution of the application on the main and secondary partitions, the unavailability is therefore limited by the synchronization (2200) of the executions.
[0008]
8. Digital storage device comprising a file corresponding to instruction codes implementing the method according to one of the preceding claims.
[0009]
9. Device implementing the method according to one of claims 1 to 7.
[0010]
10. Device according to one of claims 8 or 9 comprising a plurality of modules inserted into a frame characterized in that a base of a module comprises insertion guides fixed to the base these guides being of two types: A first type, to ensure final guiding and protection of a connector of the base, a second type having a cylindrical shape terminated by a cone to provide initial guidance. each guide having a corresponding means in the frame. 30
[0011]
11. Device according to claim 10, characterized in that the first type has a cylindrical shape terminated by a cone, its length being less than that of the second type of the order of one centimeter.
[0012]
12. Device according to claim 10, characterized in that the first type is a recessed piece adapted to cooperate with a cylindrical shape terminated by a cone.
[0013]
13. Device according to one of claims 10 to 12 characterized in that at least 1 guide of the second type is rigidly fixed to the base.
[0014]
14. Device according to one of claims 11 or 13, characterized in that at least one guide of the first type is fixed floatingly to the base.
[0015]
15. Device according to one of claims 12 or 13, characterized in that at least one cylindrical guide cooperating with the guide of the first type is fixed floatingly to the frame.
[0016]
16. Device according to one of claims 10 to 15, characterized in that the base comprises four guides of the first type and two guides of the second type.
[0017]
17. Device according to one of claims 10 to 16, characterized in that the guides are distributed symmetrically with respect to the axes of symmetry of the base.

类似技术:

公开号 | 公开日 | 专利标题

US9804901B2|2017-10-31|Update management for a distributed computing system

EP2975517B1|2021-12-08|Method and device for synchronised execution of an application in a high-availability environment

US9507566B2|2016-11-29|Entropy generation for a distributed computing system

CN102325192B|2013-11-13|Cloud computing implementation method and system

KR100297906B1|2001-08-07|Dynamic changes in configuration

US9389976B2|2016-07-12|Distributed persistent memory using asynchronous streaming of log records

KR100612715B1|2006-08-17|Autonomic recovery from hardware errors in an input/output fabric

Soundararajan et al.2010|Challenges in building scalable virtualized datacenter management

US8219851B2|2012-07-10|System RAS protection for UMA style memory

US20110119525A1|2011-05-19|Checkpointing in massively parallel processing

US10853160B2|2020-12-01|Methods and systems to manage alerts in a distributed computing system

US8015432B1|2011-09-06|Method and apparatus for providing computer failover to a virtualized environment

Kanso et al.2013|Achieving high availability at the application level in the cloud

FR2946769A1|2010-12-17|METHOD AND DEVICE FOR RECONFIGURING AVIONICS.

US10721296B2|2020-07-21|Optimized rolling restart of stateful services to minimize disruption

US10110502B1|2018-10-23|Autonomous host deployment in managed deployment systems

CN112912848A|2021-06-04|Power supply request management method in cluster operation process

CN110727652A|2020-01-24|Cloud storage processing system and method for realizing data processing

EP1296482A2|2003-03-26|A system and method for managing one or more domains

US10956249B2|2021-03-23|Handling clock errors on failure of interrupt mechanism

US10564863B2|2020-02-18|Identifying an availability of a system

Das et al.2009|Quantum leap cluster upgrade

KR100246546B1|2000-03-15|Time synchronization method on high speed parallel computer

Richly et al.2010|Dependable Systems 2010 Virtualization Fault Tolerance

同族专利:

公开号 | 公开日

EP2975517B1|2021-12-08|

EP2975517A3|2016-04-13|

US9575850B2|2017-02-21|

FR3023398B1|2016-07-01|

JP2016027470A|2016-02-18|

US20160004608A1|2016-01-07|

EP2975517A2|2016-01-20|

BR102015015928A2|2016-01-05|

引用文献:

公开号 | 申请日 | 公开日 | 申请人 | 专利标题

WO1993009494A1|1991-10-28|1993-05-13|Digital Equipment Corporation|Fault-tolerant computer processing using a shadow virtual processor|

EP2709010A1|2012-09-14|2014-03-19|General Electric Company|System and method for synchronizing processor instruction execution|

US4342083A|1980-02-05|1982-07-27|The Bendix Corporation|Communication system for a multiple-computer system|

US7966538B2|2007-10-18|2011-06-21|The Regents Of The University Of Michigan|Microprocessor and method for detecting faults therein|

US8826077B2|2007-12-28|2014-09-02|International Business Machines Corporation|Defining a computer recovery process that matches the scope of outage including determining a root cause and performing escalated recovery operations|JP2019115977A|2017-12-26|2019-07-18|キヤノン株式会社|Image forming device, control method of the same and program|

CN109739568B|2018-12-19|2021-12-21|卡斯柯信号有限公司|Security platform starting method based on 2-by-2-out-of-2 architecture|

CN109947596A|2019-03-19|2019-06-28|浪潮商用机器有限公司|PCIE device failure system delay machine processing method, device and associated component|

CN110071851B|2019-04-19|2022-01-25|成都飞机工业（集团）有限责任公司|System and method for measuring flight test data delay|

法律状态:
2015-06-25| PLFP| Fee payment|Year of fee payment: 2 |

2016-01-08| PLSC| Publication of the preliminary search report|Effective date: 20160108 |

2016-06-22| PLFP| Fee payment|Year of fee payment: 3 |

2017-06-21| PLFP| Fee payment|Year of fee payment: 4 |

2018-06-21| PLFP| Fee payment|Year of fee payment: 5 |

2019-07-25| PLFP| Fee payment|Year of fee payment: 6 |

2020-07-28| PLFP| Fee payment|Year of fee payment: 7 |

2021-07-26| PLFP| Fee payment|Year of fee payment: 8 |

优先权:

申请号 | 申请日 | 专利标题

FR1456274A|FR3023398B1|2014-07-01|2014-07-01|METHOD AND APPARATUS FOR SYNCHRONIZED EXECUTION OF AN APPLICATION IN A HIGH-AVAILABILITY ENVIRONMENT|FR1456274A| FR3023398B1|2014-07-01|2014-07-01|METHOD AND APPARATUS FOR SYNCHRONIZED EXECUTION OF AN APPLICATION IN A HIGH-AVAILABILITY ENVIRONMENT|

EP15174627.8A| EP2975517B1|2014-07-01|2015-06-30|Method and device for synchronised execution of an application in a high-availability environment|

JP2015133056A| JP2016027470A|2014-07-01|2015-07-01|Method and device for synchronously running application in high availability environment|

US14/788,867| US9575850B2|2014-07-01|2015-07-01|Method and device for synchronously running an application in a high availability environment|

BR102015015928A| BR102015015928A2|2014-07-01|2015-07-01|Method and device for synchronously running an application in a high availability environment|

[返回顶部]